Goto

Collaborating Authors

 assumption 2


Regularized least squares learning with heavy-tailed noise is minimax optimal

Neural Information Processing Systems

This paper examines the performance of ridge regression in reproducing kernel Hilbert spaces in the presence of noise that exhibits a finite number of higher moments. We establish excess risk bounds consisting of subgaussian and polynomial terms based on the well known integral operator framework. The dominant subgaussian component allows to achieve convergence rates that have previously only been derived under subexponential noise--a prevalent assumption in related work from the last two decades. These rates are optimal under standard eigenvalue decay conditions, demonstrating the asymptotic robustness of regularized least squares against heavy-tailed noise. Our derivations are based on a Fuk-Nagaev inequality for Hilbert-space valued random variables.


Escaping saddle points without Lipschitz smoothness: the power of nonlinear preconditioning

Neural Information Processing Systems

We study generalized smoothness in nonconvex optimization, focusing on (L0,L1)smoothness and anisotropic smoothness. The former was empirically derived from practical neural network training examples, while the latter arises naturally in the analysis of nonlinearly preconditioned gradient methods. We introduce a new sufficient condition that encompasses both notions, reveals their close connection, and holds in key applications such as phase retrieval and matrix factorization. Leveraging tools from dynamical systems theory, we then show that nonlinear preconditioning - including gradient clipping - preserves the saddle point avoidance property of classical gradient descent. Crucially, the assumptions required for this analysis are actually satisfied in these applications, unlike in classical results that rely on restrictive Lipschitz smoothness conditions. We further analyze a perturbed variant that efficiently attains second-order stationarity with only logarithmic dependence on dimension, matching similar guarantees of classical gradient methods.


ANear-Optimal Algorithm for Decentralized Convex-Concave Finite-Sum Minimax Optimization

Neural Information Processing Systems

In this paper, we study the distributed convex-concave finite-sum minimax optimization over the network, and a decentralized variance-reduced optimistic gradient method with stochastic mini-batch sizes (DIVERSE) is proposed.


acb3e20075b0a2dfa3565f06681578e5-Paper-Conference.pdf

Neural Information Processing Systems

This paper investigates convex-concave minimax optimization problems where only the function value access is allowed. We introduce a class of Hessianaware quantum zeroth-order methods that can find the วซ-saddle point within O(d2/3วซ 2/3) function value oracle calls. This represents an improvement of d1/3วซ 1/3 over the O(dวซ 1) upper bound of classical zeroth-order methods, where d denotes the problem dimension. We extend these results to ยต-stronglyconvex ยต-strongly-concave minimax problems using a restart strategy, and show a speedup of d1/3ยต 1/3 compared to classical zeroth-order methods. The acceleration achieved by our methods stems from the construction of efficient quantum estimators for the Hessian and the subsequent design of efficient Hessian-aware algorithms. In addition, we apply such ideas to non-convex optimization, leading to a reduction in the query complexity compared to classical methods.


Nearly Dimension-Independent Convergence of Mean-Field Black-Box Variational Inference

Neural Information Processing Systems

We prove that, given a mean-field location-scale variational family, black-box variational inference (BBVI) with the reparametrization gradient converges at a rate that is nearly independent of any explicit dimension dependence. Specifically, for a d-dimensional strongly log-concave and log-smooth target, the number of iterations for BBVI with a sub-Gaussian family to obtain a solution ฯต-close to the global optimum has an explicit dimension dependence no larger than O(logd). This is a significant improvement over the O(d)dependence of full-rank locationscale families. For heavy-tailed families, we prove a weaker O(d2/k)dependence, where kis the number of finite moments of the family. Additionally, if the Hessian of the target log-density is constant, the complexity is free of any explicit dimension dependence. We also prove that our bound on the gradient variance, which is key to our result, cannot be improved using only spectral bounds on the Hessian of the target log-density.


Non-convex entropic mean-field optimization via Best Response flow

Neural Information Processing Systems

We study the problem of minimizing non-convex functionals on the space of probability measures, regularized by the relative entropy (KL divergence) with respect to a fixed reference measure, as well as the corresponding problem of solving entropy-regularized non-convex-non-concave min-max problems. We utilize the Best Response flow (also known in the literature as the fictitious play flow) and study how its convergence is influenced by the relation between the degree of non-convexity of the functional under consideration, the regularization parameter and the tail behaviour of the reference measure. In particular, we demonstrate how to choose the regularizer, given the non-convex functional, so that the Best Response operator becomes a contraction with respect to the L1Wasserstein distance, which ensures the existence of its unique fixed point that is then shown to be the unique global minimizer for our optimization problem. This extends recent results where the Best Response flow was applied to solve convex optimization problems regularized by the relative entropy with respect to arbitrary reference measures, and with arbitrary values of the regularization parameter. Our results explain precisely how the assumption of convexity can be relaxed, at the expense of making a specific choice of the regularizer. Additionally, we demonstrate how these results can be applied in reinforcement learning in the context of policy optimization for Markov Decision Processes and Markov games with softmax parametrized policies in the mean-field regime.


Semi-infinite Nonconvex Constrained Min-Max Optimization

Neural Information Processing Systems

Semi-Infinite Programming (SIP) has emerged as a powerful framework for modeling problems with infinite constraints, however, its theoretical development in the context of nonconvex and large-scale optimization remains limited. In this paper, we investigate a class of nonconvex min-max optimization problems with nonconvex infinite constraints, motivated by applications such as adversarial robustness and safety-constrained learning. We propose a novel inexact dynamic barrier primal-dual algorithm and establish its convergence properties.


https://papers.nips.cc/paper_files/paper/2025/file/9a07bb7288caaea2ecc4c367188bc6db-Paper-Conference.pdf

Neural Information Processing Systems

Stochastic Natural Gradient Variational Inference (NGVI) is a widely used method for approximating posterior distribution in probabilistic models. Despite its empirical success and foundational role in variational inference, its theoretical underpinnings remain limited, particularly in the case of non-conjugate likelihoods. While NGVI has been shown to be a special instance of Stochastic Mirror Descent, and recent work has provided convergence guarantees using relative smoothness and strong convexity for conjugate models, these results do not extend to the nonconjugate setting, where the variational loss becomes non-convex and harder to analyze. In this work, we focus on mean-field parameterization and advance the theoretical understanding of NGVI in three key directions. First, we derive sufficient conditions under which the variational loss satisfies relative smoothness with respect to a suitable mirror map. Second, leveraging this structure, we propose a modified NGVI algorithm incorporating non-Euclidean projections and prove its global non-asymptotic convergence to a stationary point. Finally, under additional structural assumptions about the likelihood, we uncover hidden convexity properties of the variational loss and establish fast global convergence of NGVI to a global optimum. These results provide new insights into the geometry and convergence behavior of NGVI in challenging inference settings.


Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Neural Information Processing Systems

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs--constant, exponential decay, and warmup-stable-decay (WSD)--and derive explicit scaling relations in both data-and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data-and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.


Accelerating Model-Free Optimization via Averaging of Cost Samples

Neural Information Processing Systems

Model-free optimization methods typically rely on cost samples gathered by perturbing the current solution estimate along a finite and fixed set of directions. However, at each iteration, only the current cost samples are used, while potentially informative, previously collected samples are discarded. In this work, we challenge this conventional approach by introducing a simple yet effective memory mechanism that maintains an auxiliary vector of iteratively updated cost samples. By leveraging this stored information, our method estimates descent directions through an averaging of all perturbing directions weighted by the auxiliary vector components.